What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

نویسندگان

  • Lilian Besson
  • Emilie Kaufmann
چکیده

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any nonanytime algorithm is the “Doubling Trick”. In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (minimax) bounds in RT = O( √ T ) but cannot conserve (distribution-dependent) bounds in RT = O(log T ). We give insights as to why exponential doubling tricks may be better, as they conserve bounds in RT = O(log T ), and are close to conserving bounds in RT = O( √ T ).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Generalized Gittins Index for a Class of Multiarmed Bandits with General Resource Requirements

We generalise classical multi-armed and restless bandits to allow for the distribution of a (fixed amount of a) divisible resource among the constituent bandits at each decision point. Bandit activation consumes amounts of the available resource which may vary by bandit and state. Any collection of bandits may be activated at any decision epoch provided they do not consume more resource than is...

متن کامل

Modal Bandits

Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.

متن کامل

Ad Hoc Teamwork Modeled with Multi-armed Bandits: An Extension to Discounted Infinite Rewards

Before deployment, agents designed for multiagent team settings are commonly developed together or are given standardized communication and coordination protocols. However, in many cases this pre-coordination is not possible because the agents do not know what agents they will encounter, resulting in ad hoc team settings. In these problems, the agents must learn to adapt and cooperate with each...

متن کامل

Playing in stochastic environment: from multi-armed bandits to two-player games

Given a zero-sum infinite game we examine the question if players have optimal memoryless deterministic strategies. It turns out that under some general conditions the problem for twoplayer games can be reduced to the same problem for one-player games which in turn can be reduced to a simpler related problem for multi-armed bandits. Digital Object Identifier 10.4230/LIPIcs.FSTTCS.2010.65

متن کامل

Dynamic Ad Allocation: Bandits with Budgets

We consider an application of multi-armed bandits to internet advertising (specifically, to dynamic ad allocation in the pay-per-click model, with uncertainty on the click probabilities). We focus on an important practical issue that advertisers are constrained in how much money they can spend on their ad campaigns. This issue has not been considered in the prior work on bandit-based approaches...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018